[XPU] add build_sampling_params op. by Jiajun-Ji · Pull Request #7738 · PaddlePaddle/FastDeploy

Jiajun-Ji · 2026-05-07T10:44:54Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

将XPU下的padding_sampling_params的py实现改为XPU kernel实现build_sampling_params，此外将infer_seed更新收敛到build_sampling_params内部，并将infer_seed的increment_value步进对齐GPU实现。

Modifications

Usage or Command

Accuracy Tests

测试XPU kernel内INT64取模正常

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-07T10:45:03Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 为 XPU 后端新增 build_sampling_params 自定义算子，用 XPU kernel 替换原先 Python 的 sampling 参数 padding 逻辑，并尝试将 infer_seed 的更新收敛到算子内部，以对齐 GPU 的 seed 步进策略（尤其在 speculative decoding 场景）。

Changes:

新增 XPU build_sampling_params kernel + plugin wrapper + Paddle static op，并在 XPU speculative verify（TARGET_MATCH）路径中接入。
XPU ModelRunner 侧引入 increment_value（对齐 GPU：非 speculative 为 4，speculative 为 (num_speculative_tokens+1)*4），并调整 infer_seed 的更新时机。
新增 custom_ops/xpu_ops/test/test_build_sampling_params.py 单测，对比 Python 参考实现并覆盖多种 batch 形态与 wrap-around。

PR 元信息检查（需补充）

标题已包含 [XPU] tag，格式符合要求。
描述中 “Modifications / Usage or Command / Accuracy Tests” 等小节未补全；若该算子会影响采样结果或可复现性，建议补充 accuracy 对比与对应运行命令/环境信息；如不加单测或无法跑到 XPU CI，也需注明原因（本 PR 已新增单测文件，但仍建议在描述里给出如何运行）。

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
fastdeploy/worker/xpu_model_runner.py	计算并下发 `increment_value`，并调整 speculative 场景下 `infer_seed` 的更新逻辑
fastdeploy/model_executor/layers/sample/sampler.py	XPU verify(TARGET_MATCH) 路径改用 `build_sampling_params`，并透传 `increment_value`
custom_ops/xpu_ops/test/test_build_sampling_params.py	新增 XPU op 单测，与 Python 参考实现对齐校验
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp	新增 plugin wrapper（CPU + XPU3 分发）
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu	新增 Kunlun3 XPU kernel 实现
custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h	导出 `build_sampling_params` 声明
custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc	新增 Paddle static op 注册与调用桥接

            # 7. Updata 'infer_seed' and step_paddle()
-            self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
-            self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
+            if not self.speculative_decoding:


+                share_inputs["seq_lens_this_time"],
+                share_inputs["seq_lens_encoder"],
+                token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]),
+                increment_value=increment_value,


+  api::Context* ctx = xpu_ctx->x_context();
+  if (top_p.is_cpu()) {
+    ctx = new api::Context(api::kCPU);


+  // Shared prefix-sum buffer: each cluster computes its own pad_start via
+  // a two-pass scan over seq_lens_this_time / seq_lens_encoder.
+  // We use a simple approach: core 0 of cluster 0 writes per-batch start
+  // offsets into a global scratch area is not available here, so instead we
+  // compute pad_start with a sequential scan in core 0 of each cluster.
+  // Because clusters run concurrently we cannot share a global accumulator;
+  // instead each cluster independently sums the first `bi` entries.
+  // This is O(bs) per cluster but bs is typically small (<=512).


PaddlePaddle-bot · 2026-05-07T11:25:19Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 15:01:04

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: d9afbb7
Merge base: 203c7da (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

存在 1 个 Required 任务失败（Approval），2 个 Required 任务运行中，当前 CI 不满足合并条件，需处理 Approval 失败问题。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	34	5	2	1	0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	7s	PR问题：缺少自定义OP所需FastDeploy RD和PaddlePaddle RD各1人审批	请求@qingqing01等FD RD和@jeff41404等PP RD Approve	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
✅	其余 7 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 27/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	43m37s	Job	-
❌	`Trigger Jenkins for PR`	22m47s	Job	-
❌	`xpu_unit_test / run_xpu_unit_test`	6m7s	Job	-
❌	`Check PR Template`	14s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 27 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — PR流程（置信度: 高）

Approval

状态: ❌ 失败
错误类型: PR流程（自定义OP审批）
置信度: 高
根因摘要: PR添加自定义OP，缺失FastDeploy RD和PaddlePaddle RD各1人审批
分析器: 通用分析(fallback)

根因详情:
check_approval.sh 检测到该 PR 新增了自定义算子（custom op），触发了额外的审批要求。脚本报告 "There are 2 approved errors"，即：(1) 未获得 FastDeploy RD（qingqing01/Jiang-Jia-Jun/heavengate）审批；(2) 未获得 PaddlePaddle RD（jeff41404/yongqiangma）审批。退出码 6 表示两项审批均缺失。

关键日志:

0. You must have one FastDeploy RD (qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng)) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404(gaoxiang), yongqiangma(mayongqiang)) approval for adding custom op.

There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请求以下 FastDeploy RD 中的一位 Review 并 Approve：@qingqing01（dangqingqing）/ @Jiang-Jia-Jun（jiangjiajun）/ @heavengate（dengkaipeng）
请求以下 PaddlePaddle RD 中的一位 Review 并 Approve：@jeff41404（gaoxiang）/ @yongqiangma（mayongqiang）

修复建议摘要: 请求@qingqing01等FastDeploy RD和@jeff41404等PaddlePaddle RD Approve

关联变更: PR 标题 [XPU] add build_sampling_params op，涉及新增 XPU 自定义算子
链接: 查看日志

codecov-commenter · 2026-05-07T13:02:54Z

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f928343). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/xpu_model_runner.py	0.00%	4 Missing ⚠️
fastdeploy/model_executor/layers/sample/sampler.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7738   +/-   ##
==========================================
  Coverage           ?   63.15%           
==========================================
  Files              ?      461           
  Lines              ?    64129           
  Branches           ?     9824           
==========================================
  Hits               ?    40501           
  Misses             ?    20852           
  Partials           ?     2776

Flag	Coverage Δ
GPU	`72.27% <0.00%> (?)`
XPU	`7.13% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

+            top_k=sampling_metadata.top_k,
            top_k_list=sampling_metadata.top_k_list,
-            topp_seed=topp_seed,
+            topp_seed=sampling_metadata.topp_seed,


+        self.increment_value = (
+            4 if not self.speculative_decoding else (self.speculative_config.num_speculative_tokens + 1) * 4
+        )


            # 7. Updata 'infer_seed' and step_paddle()
-            self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
-            self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
+            if not self.speculative_decoding:


+  // Shared prefix-sum buffer: each cluster computes its own pad_start via
+  // a two-pass scan over seq_lens_this_time / seq_lens_encoder.
+  // We use a simple approach: core 0 of cluster 0 writes per-batch start
+  // offsets into a global scratch area is not available here, so instead we
+  // compute pad_start with a sequential scan in core 0 of each cluster.
+  // Because clusters run concurrently we cannot share a global accumulator;
+  // instead each cluster independently sums the first `bi` entries.
+  // This is O(bs) per cluster but bs is typically small (<=512).
+
+  for (int bi = clusterid; bi < bs; bi += nclusters) {
+    if (cid == 0) {
+      // Read per-batch parameters from global memory.
+      float lm_top_p;
+      int64_t lm_top_k;
+      int64_t lm_seed;
+      int lm_slt;  // seq_lens_this_time[bi]
+      int lm_sle;  // seq_lens_encoder[bi]
+
+      GM2LM_ASYNC(top_p + bi, &lm_top_p, sizeof(float));
+      GM2LM_ASYNC(top_k + bi, &lm_top_k, sizeof(int64_t));
+      GM2LM_ASYNC(infer_seed + bi, &lm_seed, sizeof(int64_t));
+      GM2LM_ASYNC(seq_lens_this_time + bi, &lm_slt, sizeof(int));
+      GM2LM(seq_lens_encoder + bi, &lm_sle, sizeof(int));  // sync barrier
+
+      bool is_decoder = (lm_sle == 0);
+      int repeat = is_decoder ? lm_slt : 1;
+
+      // Compute pad_start = sum of token counts for batches [0, bi).
+      int pad_start = 0;
+      for (int k = 0; k < bi; k++) {
+        int slt_k, sle_k;
+        GM2LM_ASYNC(seq_lens_this_time + k, &slt_k, sizeof(int));
+        GM2LM(seq_lens_encoder + k, &sle_k, sizeof(int));
+        pad_start += (sle_k == 0) ? slt_k : 1;
+      }


+RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-814e8d96-3da8-46b0-b4da-31925c313041] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n    prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n    await self.add_requests(request)\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n    raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[])
+RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-799cdf97-ab7e-4823-80e4-1833bf5f7d90] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n    prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n    await self.add_requests(request)\\n  File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n    raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[])


Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

        _, next_tokens = top_k_top_p_sampling(
            probs,
-            top_p=top_p,
-            top_k=top_k,
+            top_p=sampling_metadata.top_p,
+            top_k=sampling_metadata.top_k,
            top_k_list=sampling_metadata.top_k_list,
-            topp_seed=topp_seed,
+            topp_seed=sampling_metadata.topp_seed,
        )


                sampling_metadata.seed,
-                paddle.reshape(share_inputs["seq_lens_this_time"], shape=[-1]),
-                paddle.reshape(share_inputs["seq_lens_encoder"], shape=[-1]),
+                share_inputs["seq_lens_this_time"],
+                share_inputs["seq_lens_encoder"],
+                token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]),
+                increment_value=increment_value,
            )


-            self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
+            if not self.speculative_decoding:
+                self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
+                self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED


+  int64_t pad_idx = 0;
+  for (int bi = 0; bi < bs; bi++) {
+    bool is_decoder = (seq_lens_encoder[bi] == 0);
+    int repeat = is_decoder ? seq_lens_this_time[bi] : 1;
+    int64_t bi_seed = infer_seed[bi];
+    for (int local_pos = 0; local_pos < repeat; local_pos++) {
+      int64_t offset = is_decoder ? static_cast<int64_t>(local_pos) * 4 : 0LL;
+      top_p_padding[pad_idx] = top_p[bi];
+      top_k_padding[pad_idx] = top_k[bi];
+      topp_seed[pad_idx] = (bi_seed + offset) % BUILD_SAMPLING_MAX_INFER_SEED;
+      pad_idx++;
+    }
+    infer_seed[bi] =
+        (infer_seed[bi] + increment_value) % BUILD_SAMPLING_MAX_INFER_SEED;
+  }


+  // Shared prefix-sum buffer: each cluster computes its own pad_start via
+  // a two-pass scan over seq_lens_this_time / seq_lens_encoder.
+  // We use a simple approach: core 0 of cluster 0 writes per-batch start
+  // offsets into a global scratch area is not available here, so instead we
+  // compute pad_start with a sequential scan in core 0 of each cluster.
+  // Because clusters run concurrently we cannot share a global accumulator;
+  // instead each cluster independently sums the first `bi` entries.
+  // This is O(bs) per cluster but bs is typically small (<=512).
+
+  for (int bi = clusterid; bi < bs; bi += nclusters) {
+    if (cid == 0) {
+      // Read per-batch parameters from global memory.
+      float lm_top_p;
+      int64_t lm_top_k;
+      int64_t lm_seed;
+      int lm_slt;  // seq_lens_this_time[bi]
+      int lm_sle;  // seq_lens_encoder[bi]
+
+      GM2LM_ASYNC(top_p + bi, &lm_top_p, sizeof(float));
+      GM2LM_ASYNC(top_k + bi, &lm_top_k, sizeof(int64_t));
+      GM2LM_ASYNC(infer_seed + bi, &lm_seed, sizeof(int64_t));
+      GM2LM_ASYNC(seq_lens_this_time + bi, &lm_slt, sizeof(int));
+      GM2LM(seq_lens_encoder + bi, &lm_sle, sizeof(int));  // sync barrier
+
+      bool is_decoder = (lm_sle == 0);
+      int repeat = is_decoder ? lm_slt : 1;
+
+      // Compute pad_start = sum of token counts for batches [0, bi).
+      int pad_start = 0;
+      for (int k = 0; k < bi; k++) {
+        int slt_k, sle_k;
+        GM2LM_ASYNC(seq_lens_this_time + k, &slt_k, sizeof(int));
+        GM2LM(seq_lens_encoder + k, &sle_k, sizeof(int));
+        pad_start += (sle_k == 0) ? slt_k : 1;
+      }


Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

        """Normal sampling for NAIVE mode on XPU."""
-        top_p, top_k, topp_seed = padding_sampling_params(
-            sampling_metadata.top_p,
-            sampling_metadata.top_k,
-            sampling_metadata.seed,
-            paddle.reshape(share_inputs["seq_lens_this_time"], shape=[-1]),
-            paddle.reshape(share_inputs["seq_lens_encoder"], shape=[-1]),
-        )
        _, next_tokens = top_k_top_p_sampling(
            probs,
-            top_p=top_p,
-            top_k=top_k,
+            top_p=sampling_metadata.top_p,
+            top_k=sampling_metadata.top_k,
            top_k_list=sampling_metadata.top_k_list,
-            topp_seed=topp_seed,
+            topp_seed=sampling_metadata.topp_seed,
        )


+                share_inputs["seq_lens_this_time"],
+                share_inputs["seq_lens_encoder"],
+                token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]),
+                increment_value=increment_value,
            )


            # 7. Updata 'infer_seed' and step_paddle()
-            self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
-            self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
+            if not self.speculative_decoding:
+                self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
+                self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED


+PD_BUILD_STATIC_OP(build_sampling_params)
+    .Inputs({"top_p",
+             "top_k",
+             "infer_seed",
+             "seq_lens_this_time",
+             "seq_lens_encoder"})
+    .Outputs({"top_p_padding", "top_k_padding", "topp_seed"})
+    .Attrs({"token_num_output_cpu: int64_t", "increment_value: int64_t"})
+    .SetKernelFn(PD_KERNEL(BuildSamplingParams))
+    .SetInferShapeFn(PD_INFER_SHAPE(BuildSamplingParamsInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(BuildSamplingParamsInferDtype));


+  // Shared prefix-sum buffer: each cluster computes its own pad_start via
+  // a two-pass scan over seq_lens_this_time / seq_lens_encoder.
+  // We use a simple approach: core 0 of cluster 0 writes per-batch start
+  // offsets into a global scratch area is not available here, so instead we
+  // compute pad_start with a sequential scan in core 0 of each cluster.
+  // Because clusters run concurrently we cannot share a global accumulator;
+  // instead each cluster independently sums the first `bi` entries.
+  // This is O(bs) per cluster but bs is typically small (<=512).


+"""
+Unit tests for build_sampling_params XPU op.
+
+Verifies that the XPU kernel produces the same output as the Python reference
+implementation (padding_sampling_params) for all cases:
+  - pure decoder batches (seq_lens_encoder == 0)
+  - pure encoder batches (seq_lens_encoder > 0)
+  - mixed encoder/decoder batches
+  - single-item batch (bs=1)
+  - seed wrap-around near MAX_INFER_SEED
+"""


PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 14:22:53

📋 Review 摘要

PR 概述：将 XPU 下 padding_sampling_params 的 Python 实现替换为 XPU kernel 实现 build_sampling_params，同时将 infer_seed 更新收敛到 kernel 内部，并对齐 GPU 的 increment_value 步进。
变更范围：custom_ops/xpu_ops/（新 kernel + wrapper + op 注册）、fastdeploy/model_executor/layers/sample/sampler.py、fastdeploy/worker/xpu_model_runner.py
影响面 Tag：[XPU] [OP]

📝 PR 规范检查

标题含合法 Tag [XPU]，格式合规。但 ## Modifications 与 ## Usage or Command 两个 section 内容缺失（仅保留模板注释），Checklist 均未勾选。

标题建议（可直接复制）：

[XPU][OP] Add build_sampling_params XPU kernel to replace Python padding_sampling_params

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
将 XPU 下 `padding_sampling_params` 的纯 Python 实现改为 XPU kernel 实现 `build_sampling_params`，以减少主机-设备同步开销。此外将 `infer_seed` 的更新逻辑收敛到 kernel 内部，并将 increment_value 步进从原来的 XPU 固定值 4 对齐至 GPU 实现（非投机解码: 4，投机解码: (num_speculative_tokens + 1) * 4）。

## Modifications
- 新增 XPU kernel：`custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`，实现 top_p/top_k/seed padding 及 infer_seed in-place 更新
- 新增 C++ wrapper：`custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`，支持 CPU 和 XPU3 两条执行路径
- 新增 Paddle Custom Op 注册：`custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`
- 在 `plugin.h` 中声明 `build_sampling_params` 接口
- `sampler.py`：在 `_verify_and_sample_xpu` 中用 `build_sampling_params` 替换 `padding_sampling_params`；`_normal_sample_xpu` 直接使用 `sampling_metadata` 字段
- `xpu_model_runner.py`：计算 `increment_value`，非投机解码时继续在 Python 层更新 `infer_seed`，投机解码时由 kernel 内部更新
- 新增单测：`custom_ops/xpu_ops/test/test_build_sampling_params.py`，覆盖纯解码、纯编码、混合、单 batch、seed 回绕等场景

## Usage or Command
N/A（内部实现替换，对外接口不变）

## Accuracy Tests
测试 XPU kernel 内 INT64 取模正常（见 PR 内截图）；精度测试结果与原 Python 实现一致。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`fastdeploy/model_executor/layers/sample/sampler.py:1080`	`sampling_metadata.topp_seed` 属性不存在，NAIVE 模式 XPU 采样将在运行时崩溃

总体评价

新增 XPU kernel 实现思路清晰，CPU/XPU3 双路径覆盖完整，单测用例充分。但 _normal_sample_xpu 中使用了 sampling_metadata.topp_seed 这一不存在的属性（SamplingMetadata 只有 seed 字段），将导致非投机解码 XPU 路径在运行时直接崩溃，需修复后合入。

PaddlePaddle-bot · 2026-05-12T06:25:02Z

+            top_k=sampling_metadata.top_k,
            top_k_list=sampling_metadata.top_k_list,
-            topp_seed=topp_seed,
+            topp_seed=sampling_metadata.topp_seed,


🔴 Bug sampling_metadata.topp_seed 属性不存在，将导致 AttributeError。

SamplingMetadata 数据类（meta_data.py）中只有 seed 字段，没有 topp_seed。当 XPU NAIVE 模式（非投机解码）调用 _normal_sample_xpu 时，该行会在运行时抛出 AttributeError: 'SamplingMetadata' object has no attribute 'topp_seed'。

建议修复方案（二选一）：

直接使用已有的 seed 字段（但需注意 seed 是否已经是 padded 形式）：

topp_seed=sampling_metadata.seed,

若需要 padded seed，应在 SamplingMetadata 中新增 topp_seed: Optional[paddle.Tensor] = None，并在 xpu_model_runner.py 的 SamplingMetadata(...) 构造处通过 padding_sampling_params（或新的 build_sampling_params kernel）填充该字段。

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 14:35:06

📋 Review 摘要

PR 概述：将 XPU 下 padding_sampling_params 的 Python 实现替换为 XPU kernel 实现 build_sampling_params，同时将 infer_seed 更新逻辑内聚到 kernel 内部（speculative 路径）并对齐 GPU 的步进值。
变更范围：custom_ops/xpu_ops/（新增 kernel/wrapper/op 注册）、sampler.py、xpu_model_runner.py
影响面 Tag：[XPU] [OP]

📝 PR 规范检查

标题含 [XPU] Tag 合规。但描述中 ## Modifications 和 ## Usage or Command 段内容为空（仅含模板注释），## Checklist 条目均未勾选，建议按如下模板补全。

标题建议（可直接复制）：

[XPU][OP] Add build_sampling_params XPU kernel to replace padding_sampling_params

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
将 XPU 下的 `padding_sampling_params` Python 实现改为 XPU kernel 实现 `build_sampling_params`，在 kernel 内完成采样参数 padding 操作。同时将 `infer_seed` 更新逻辑收敛到 `build_sampling_params` 内部（speculative decoding 路径），并将 `infer_seed` 的 `increment_value` 步进对齐 GPU 实现（每 token 步进 4）。

## Modifications
- `custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`：新增 XPU3 kernel，支持 decoder/encoder 混合 batch，cluster 0 原地更新 `infer_seed`
- `custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`：新增 CPU 参考实现与 XPU3 wrapper
- `custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`：使用 `PD_BUILD_STATIC_OP` 注册 custom op，输出 `top_p_padding`、`top_k_padding`、`topp_seed`
- `custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h`：声明 `build_sampling_params` 函数接口
- `fastdeploy/model_executor/layers/sample/sampler.py`：`_verify_and_sample_xpu` 替换 `padding_sampling_params` 为 `build_sampling_params`；`_normal_sample_xpu` 直接使用 `sampling_metadata.topp_seed`；为 `forward_xpu` / `_verify_and_sample_xpu` 新增 `increment_value` 参数
- `fastdeploy/worker/xpu_model_runner.py`：新增 `self.increment_value`（非 speculative=4，speculative=(N+1)*4）；speculative 路径移除外部 `infer_seed.add_()` 更新，改由 kernel 内部维护
- 新增单测：`custom_ops/xpu_ops/test/test_build_sampling_params.py`（含纯 decoder、纯 encoder、混合、单 batch、seed wrap-around 等 6 个用例）

## Usage or Command
N/A

## Accuracy Tests
XPU kernel 内 INT64 取模验证正常（见 PR 附图）。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc:28`	`infer_seed` 以非 const 引用原地修改，但在 `PD_BUILD_STATIC_OP` 中仅声明为 `.Inputs`，缺少 `SetInplaceMap` 声明
🟡 建议	`custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp:145`	Wrapper 仅支持 XPU3，其他代次（如 XPU2）运行时会触发 `WRAPPER_UNIMPLEMENTED`
❓ 疑问	`custom_ops/xpu_ops/`	新增 `.cc` / `.cpp` / `.xpu` 源文件，diff 未见 `setup_ops.py` 或 `CMakeLists.txt` 更新，请确认这些文件是否已加入编译链

总体评价

实现思路清晰，将采样参数 padding 与 infer_seed 更新下沉到 XPU kernel，减少主机端开销，并提供了覆盖多场景的单测。建议关注 Paddle custom op 框架的 inplace 声明语义和编译注册完整性。

PaddlePaddle-bot · 2026-05-13T06:44:41Z

+std::vector<paddle::Tensor> BuildSamplingParams(
+    const paddle::Tensor& top_p,
+    const paddle::Tensor& top_k,
+    paddle::Tensor& infer_seed,


🟡 建议 infer_seed 以非 const 引用传入并在 kernel 内原地更新，但 PD_BUILD_STATIC_OP 中仅通过 .Inputs({"infer_seed"}) 声明为只读输入，未声明 SetInplaceMap。

根据 Paddle custom op 规范，原地修改输入 tensor 需通过 SetInplaceMap 显式告知框架，否则在 AOT / Static Graph 场景下框架可能对输入创建副本，导致 infer_seed 的更新对外不可见（单测中对 seed.numpy() 的结果校验也将失败）。

建议将 infer_seed 同时加入 Outputs 并声明 inplace 映射：

PD_BUILD_STATIC_OP(build_sampling_params) .Inputs({"top_p", "top_k", "infer_seed", "seq_lens_this_time", "seq_lens_encoder"}) .Outputs({"top_p_padding", "top_k_padding", "topp_seed", "infer_seed_updated"}) .SetInplaceMap({{"infer_seed", "infer_seed_updated"}}) ...

或直接将更新后的 infer_seed 作为第四个返回值输出。

PaddlePaddle-bot · 2026-05-13T06:44:42Z

+                       token_num,
+                       increment_value);
+  } else if (ctx->dev().type() == api::kXPU3) {
+    return xpu3_wrapper(ctx,


🟡 建议 Wrapper 当前仅对 api::kXPU3 提供实现，其他 XPU 代次（如 XPU2）会命中末尾的 WRAPPER_UNIMPLEMENTED(ctx) 并在运行时报错。

若此 op 仅针对 XPU3，建议在 op 注册层或调用侧加入硬件代次检查，提前给出明确的报错信息；若未来需要支持 XPU2，可在此处扩展对应分支。

PaddlePaddle-bot · 2026-05-13T06:52:25Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-13 14:49:37

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 4b869b9
Merge base: f928343 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

有 1 个 Required 失败任务（Approval），另有 3 个 Required 任务运行中（含主测试任务 run_tests_with_coverage），暂不建议合并，请等待 Required 任务完成后再评估。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	28	3	5	0	0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	7s	基础设施：PR未获必要审批	联系有权限团队成员审批PR	Job	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
✅	其余 6 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`xpu_unit_test / run_xpu_unit_test`	5m59s	Job	-
❌	`Check PR Template`	16s	Job	-
⏳	`CI_HPU`	-	Job	-
⏳	`Run iluvatar Tests / run_iluvatar_cases`	-	Job	-
✅	其余 22 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 基础设施（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 基础设施
置信度: 高
根因摘要: PR未获必要审批，审批流程检查失败
分析器: 通用分析(fallback)

根因详情:
Approval 工作流是验证PR是否已获得必要团队成员审批的检查任务，该工作流仅运行7秒即失败，无详细错误注释（无 annotations 信息）。这是典型的审批流程检查失败模式，与PR代码变更内容无关。深度日志获取因网络TLS超时失败，但根据工作流名称和行为特征，可高置信度判断为PR尚未获得所需审批。

关键日志:

⚠️ 无 annotations 信息（深度日志获取失败：TLS handshake timeout）

修复建议:

联系有审批权限的团队成员对本PR进行Review并Approve
确认PR描述符合规范，满足审批前提条件

修复建议摘要: 联系有权限的团队成员审批本PR

链接: 查看日志

Copilot AI review requested due to automatic review settings May 7, 2026 10:44

Jiajun-Ji had a problem deploying to Metax_ci May 7, 2026 10:44 — with GitHub Actions Failure

Copilot started reviewing on behalf of Jiajun-Ji May 7, 2026 10:45 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Jiajun-Ji had a problem deploying to Metax_ci May 7, 2026 11:33 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Copilot AI review requested due to automatic review settings May 8, 2026 03:04

Jiajun-Ji had a problem deploying to Metax_ci May 8, 2026 03:04 — with GitHub Actions Failure

Copilot started reviewing on behalf of Jiajun-Ji May 8, 2026 03:05 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Jiajun-Ji had a problem deploying to Metax_ci May 8, 2026 05:29 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Jiajun-Ji added 4 commits May 8, 2026 16:14

[XPU] add build_sampling_params op.

6b50448

remove padding_sampling_params in _normal_sample_xpu.

0e6cffb

fix top_k_top_p_sampling func call.

3cac20b

remove error_output.txt

cfc5936

Copilot AI review requested due to automatic review settings May 8, 2026 08:14

Jiajun-Ji force-pushed the xpu-build_sampling_params branch from 651d7cb to cfc5936 Compare May 8, 2026 08:14

Jiajun-Ji temporarily deployed to Metax_ci May 8, 2026 08:14 — with GitHub Actions Inactive

Copilot started reviewing on behalf of Jiajun-Ji May 8, 2026 08:15 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

add auto release cpu ctx.

f46f132

Jiajun-Ji had a problem deploying to Metax_ci May 11, 2026 06:16 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' into xpu-build_sampling_params

d9afbb7

Copilot AI review requested due to automatic review settings May 12, 2026 06:11

plusNew001 had a problem deploying to Metax_ci May 12, 2026 06:11 — with GitHub Actions Failure

Copilot started reviewing on behalf of plusNew001 May 12, 2026 06:12 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

PaddlePaddle-bot suggested changes May 12, 2026

View reviewed changes

Merge branch 'develop' into xpu-build_sampling_params

4b869b9

cmcamdy had a problem deploying to Metax_ci May 13, 2026 06:28 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 13, 2026

View reviewed changes

		RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-814e8d96-3da8-46b0-b4da-31925c313041] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n await self.add_requests(request)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[])
		RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-799cdf97-ab7e-4823-80e4-1833bf5f7d90] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n await self.add_requests(request)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[])

Conversation

Jiajun-Ji commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 7/10 通过

2.2 可选任务 — 27/32 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 13, 2026

Jiajun-Ji commented May 7, 2026 •

edited

Loading

PaddlePaddle-bot commented May 7, 2026 •

edited

Loading

codecov-commenter commented May 7, 2026 •

edited

Loading